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Abstract 



In recent years, a rich variety of shrinkage priors have been proposed that have 
great promise in addressing massive regression problems. In general, these new 
priors can be expressed as scale mixtures of normals, but have more complex 
forms and better properties than traditional Cauchy and double exponential priors. 
i— i We first propose a new class of normal scale mixtures through a novel general- 

PJ ized beta distribution that encompasses many interesting priors as special cases. 

This encompassing framework should prove useful in comparing competing pri- 
^J ors, considering properties and revealing close connections. We then develop a 

class of variational Bayes approximations through the new hierarchy presented 
that will scale more efficiently to the types of truly massive data sets that are now 
encountered routinely. 



> 1 Introduction 

Penalized likelihood estimation has evolved into a major area of research, with £\\T1\ and other 
regularization penalties now used routinely in a rich variety of domains. Often minimizing a loss 
function subject to a regularization penalty leads to an estimator that has a Bayesian interpretation as 
the mode of a posterior distribution [8] HU HI 12 1, w ^ different prior distributions inducing different 
penalties. For example, it is well known that Gaussian priors induce £2 penalties, while double ex- 
ponential priors induce £1 penalties [8, 19] SUDD- Viewing massive-dimensional parameter learning 
and prediction problems from a Bayesian perspective naturally leads one to design new priors that 
^>- have substantial advantages over the simple normal or double exponential choices and that induce 

rich new families of penalties. For example, in high-dimensional settings it is often appealing to 
have a prior that is concentrated at zero, favoring strong shrinkage of small signals and potentially a 
sparse estimator, while having heavy tails to avoid over-shrinkage of the larger signals. The Gaus- 
sian and double exponential priors are insufficiently flexible in having a single scale parameter and 
relatively light tails; in order to shrink many small signals strongly towards zero, the double expo- 
nential must be concentrated near zero and hence will over-shrink signals not close to zero. This 
phenomenon has motivated a rich variety of new priors such as the normal-exponential-gamma, the 
horseshoe and the generalized double Pareto ifTTl [141 [Tl l6l l20l 171 [T2l l2l . 

An alternative and widely applied Bayesian framework relies on variable selection priors and 
Bayesian model selection/averaging lfl"8l l9l [161 [151 . Under such approaches the prior is a mix- 
ture of a mass at zero, corresponding to the coefficients to be set equal to zero and hence excluded 
from the model, and a continuous distribution, providing a prior for the size of the non-zero signals. 
This paradigm is very appealing in fully accounting for uncertainty in parameter learning and the 
unknown sparsity structure through a probabilistic framework. One obtains a posterior distribution 
over the model space corresponding to all possible subsets of predictors, and one can use this pos- 
terior for model-averaged predictions that take into account uncertainty in subset selection and to 
obtain marginal inclusion probabilities for each predictor providing a weight of evidence that a spe- 
cific signal is non-zero allowing for uncertainty in the other signals to be included. Unfortunately, 



the computational complexity is exponential in the number of candidate predictors (2 P with p the 
number of predictors). 

Some recently proposed continuous shrinkage priors may be considered competitors to the con- 
ventional mixture priors fT5l [6] [7] Q2) yielding computationally attractive alternatives to Bayesian 
model averaging. Continuous shrinkage priors lead to several advantages. The ones represented as 
scale mixtures of Gaussians allow conjugate block updating of the regression coefficients in linear 
models and hence lead to substantial improvements in Markov chain Monte Carlo (MCMC) effi- 
ciency through more rapid mixing and convergence rates. Under certain conditions these will also 
yield sparse estimates, if desired, via maximum a posteriori (MAP) estimation and approximate 
inferences via variational approaches lTT7l l24l 151 IS1 [TT1 HI l2l . 

The class of priors that we consider in this paper encompasses many interesting priors as special 
cases and reveals interesting connections among different hierarchical formulations. Exploiting 
an equivalent conjugate hierarchy of this class of priors, we develop a class of variational Bayes 
approximations that can scale up to truly massive data sets. This conjugate hierarchy also allows 
for conjugate modeling of some previously proposed priors which have some rather complex yet 
advantageous forms and facilitates straightforward computation via Gibbs sampling. We also argue 
intuitively that by adjusting a global shrinkage parameter that controls the overall sparsity level, we 
may control the number of non-zero parameters to be estimated, enhancing results, if there is an 
underlying sparse structure. This global shrinkage parameter is inherent to the structure of the priors 
we discuss as in (6l[7) with close connections to the conventional variable selection priors. 

2 Background 

We provide a brief background on shrinkage priors focusing primarily on the priors studied by EH 
and ifTTl [T2l as well as the Strawderman-Berger (SB) prior Q. These priors possess some very 
appealing properties in contrast to the double exponential prior which leads to the Bayesian lasso |[T9l 
[T3l . They may be much heavier-tailed, biasing large signals less drastically while shrinking noise- 
like signals heavily towards zero. In particular, the priors by (6] 13, along with the Strawderman- 
Berger prior [7 1, have a very interesting and intuitive representation later given in (ph, yet, are not 
formed in a conjugate manner potentially leading to analytical and computational complexity. 

|6][7) propose a useful class of priors for the estimation of multiple means. Suppose a p-dimensional 
vector y\0 ~ J\f(0, 1) is observed. The independent hierarchical prior for Oj is given by 

B^Ts ~ N$,Ti), r 3 1/2 ~C+(0,^/ 2 ), (1) 

for j = 1, . . . ,p, where Af(p, v) denotes a normal distribution with mean p and variance v and 
C + (0, s) denotes a half-Cauchy distribution on 5R + with scale parameter s. With an appropriate 
transformation pj = 1/(1 + Tj), this hierarchy also can be represented as 

9i\ Pi ~ Af(0, 1/ Pj - 1), 7r( ft |0) ex P - 1/2 (1 - Pj )-^ —- T l . (2) 

1 + \<P ~ l)Pj 

A special case where 0=1 leads to pj ~ B(l/2, 1/2) (beta distribution) where the name of the prior 
arises, horseshoe (HS) EFT). Here pjS are referred to as the shrinkage coefficients as they determine 
the magnitude with which 9jS are pulled toward zero. A prior of the form pj ~ B(l/2, 1/2) is 
natural to consider in the estimation of a signal Oj as this yields a very desirable behavior both at 
the tails and in the neighborhood of zero. That is, the resulting prior has heavy-tails as well as being 
unbounded at zero which creates a strong pull towards zero for those values close to zero. [7] further 
discuss priors of the form pj ~ B(a, b) for a > 0, b > to elaborate more on their focus on the 
choice a = b = 1/2. A similar formulation dates back to ED . Q refer to the prior of the form 
Pj ~ B(l, 1/2) as the Strawderman-Berger prior due to [21] and [4|. The same hierarchical prior 
is also referred to as the quasi-Cauchy prior in |[T6l . Hence, the tail behavior of the Strawderman- 
Berger prior remains similar to the horseshoe (when = 1), while the behavior around the origin 
changes. The hierarchy in ([2} is much more intuitive than the one in ([1} as it explicitly reveals the 
behavior of the resulting marginal prior on Oj. This intuitive representation makes these hierarchical 
priors interesting despite their relatively complex forms. On the other hand, what the prior in (fill or 
d2| lacks is a more trivial hierarchy that yields recognizable conditional posteriors in linear models. 



ifTTl [121 consider the normal-exponential-gamma (NEG) and normal-gamma (NG) priors respec- 
tively which are formed in a conjugate manner yet lack the intuition the Strawderman-Berger and 
horseshoe priors provide in terms of the behavior of the density around the origin and at the tails. 
Hence the implementation of these priors may be more user-friendly but they are very implicit in 
how they behave. In what follows we will see that these two forms are not far from one another. 
In fact, we may unite these two distinct hierarchical formulations under the same class of priors 
through a generalized beta distribution and the proposed equivalence of hierarchies in the following 
section. This is rather important to be able to compare the behavior of priors emerging from different 
hierarchical formulations. Furthermore, this equivalence in the hierarchies will allow for a straight- 
forward Gibbs sampling update in posterior inference, as well as making variational approximations 
possible in linear models. 

3 Equivalence of Hierarchies via a Generalized Beta Distribution 

In this section we propose a generalization of the beta distribution to form a flexible class of scale 
mixtures of normals with very appealing behavior. We then formulate our hierarchical prior in a 
conjugate manner and reveal similarities and connections to the priors given in 1 16l [TT1 [121 l6l 171 . 
As the name generalized beta has previously been used, we refer to our generalization as the three- 
parameter beta (TPB) distribution. 

In the forthcoming text T(.) denotes the gamma function, G(n, v) denotes a gamma distri- 
bution with shape and rate parameters \i and v, W(f, S) denotes a Wishart distribution with 
v degrees of freedom and scale matrix S, ^(0:1,0:2) denotes a uniform distribution over 
(01,0:2), QIQ(p,,i/,£) denotes a generalized inverse Gaussian distribution with density function 
(v/iY/ 2 {2'K^Ti)}- l xi 1 - 1 exp{(>x + £/x)/2}, and K M (.) is a modified Bessel function of the 
second kind. 

Definition 1. The three-parameter beta (TPB) distribution for a random variable X is defined by 
the density function 

f(x; a, b, cb) = ££ + £Lv-i(l - *)-! {1 + (0 - l)x}" (a+6) , (3) 

r(a)r(6) 

for < x < 1, a > 0, b > and (j) > and is denoted by TVB(a, b, (f>). 

It can be easily shown by a change of variable x = l/(y + 1) that the above density integrates to 1. 
The fcth moment of the TPB distribution is given by 

where 2F1 denotes the hypergeometric function. In fact it can be shown that TPB is a subclass of 
Gauss hypergeometric (GH) distribution proposed in Q and the compound confluent hypergeomet- 
ric (CCH) distribution proposed in ifTUll . 



The density functions of GH and CCH distributions are given by 

f ( h n x"- 1 (l-x) a - 1 (l + Cx)-' r ... 

/ GH (x; a, b, r, Q = — — — , (5) 

B(6, a) 2 Fi(r,6;a + 6;-g 

b„b-l(-\ „\a-l 
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1 (l-x) a - 1 (6»+(l- 



/cChO^A^'M) = uf , , -, 1 ,^ 7 -, — 7—. a\, (6) 

B(o, a) exp(— s/^)"t>i(a, r,a + 0, s/v, 1 — 0) 

for < x < 1 and < x < 1/v, respectively, where B(6, a) — T(a)T(b)/T(a + b) denotes the beta 
function and $1 is the degenerate hypergeometric function of two variables [10|. Letting £ = 0—1, 
r = a + b and noting that 2F1 (a + b, b; a + b; 1 — (/>) = <f>~ b , (pll becomes a TPB density. Also note 
that ^ becomes Q for s = 1, v = 1 and C = (1 - 0)/9 iflOl . 

EOl considered an alternative special case of the CCH distribution for the shrinkage coefficients, 
Pj, by letting v = r = 1 in d6l). [20 1 refer to this special case as the hypergeometric-beta (HB) 
distribution. TPB and HB generalize the beta distribution in two distinct directions, with one prac- 
tical advantage of the TPB being that it allows for a straightforward conjugate hierarchy leading to 
potentially substantial analytical and computational gains. 



Now we move onto the hierarchical modeling of a flexible class of shrinkage priors for the estimation 
of a potentially sparse p-vector. Suppose a p-dimensional vector y\0 ~ J\f(6, 1) is observed where 
9 = (9i , . . . , Op)' is of interest. Now we define a shrinkage prior that is obtained by mixing a normal 
distribution over its scale parameter with the TPB distribution. 

Definition 2. The TPB normal scale mixture representation for the distribution of random variable 
9j is given by 

Oj\ Pi ~N%llpj-l), Pj ~TTB(a,b,4), (7) 

where a > 0, b > and 4> > 0. The resulting marginal distribution on 9j is denoted by 

TVBAf(a,b,(f>). 

Figure [Tlillustrates the density on pj for varying values of a, b and cf>. Note that the special case for 
a = b = 1/2 in Figure pTa) gives the horseshoe prior. Also when a = <p = 1 and b = 1/2, this 
representation yields the Strawderman-Berger prior. For a fixed value of (f>, smaller a values yield 
a density on 9j that is more peaked at zero, while smaller values of b yield a density on 6j that is 
heavier tailed. For fixed values of a and b, decreasing <f> shifts the mass of the density on pj from 
left to right, suggesting more support for stronger shrinkage. That said, the density assigned in the 
neighborhood of 8j = increases while making the overall density lighter-tailed. We next propose 
the equivalence of three hierarchical representations revealing a wide class of priors encompassing 
many of those mentioned earlier. 

Proposition 1. 7/0, ~ TVBJ\f(a, b, <j>), then 

1) 6j ~ Af(0, Tj), tj ~ G(a, Xj) and Xj ~ G(b, 4>). 

2) 0j ~ W(0, tj), n(Tj) = r£^j <f>- a T a - 1 (l + T i /0)-( Q+fc ) which implies that t /<j> - /3'(a, b), 
the inverted beta distribution with parameters a and b. 

The equivalence given in Proposition 1 is significant as it makes the work in Section 4 possible 
under the TPB normal scale mixtures as well as further revealing connections among previously 
proposed shrinkage priors. It provides a rich class of priors leading to great flexibility in terms of 
the induced shrinkage and makes it clear that this new class of priors can be considered simultaneous 
extensions to the work by lfTTl[T2"l and EQ. It is worth mentioning that the hierarchical prior(s) 
given in Proposition 1 are different than the approach taken by |12| in how we handle the mixing. 
In particular, the first hierarchy presented in Proposition 1 is identical to the NG prior up to the first 
stage mixing. While fixing the values of a and b, we further mix over A^ (rather than a global A) 
and further over <f> if desired as will be discussed later. <f> acts as a global shrinkage parameter in 
the hierarchy. On the other hand, lfT2l choose to further mix over a and a global A while fixing the 
values of b and <fr. By doing so, they forfeit a complete conjugate structure and an explicit control 
over the tail behavior of k{9j). 

As a direct corollary to Proposition 1, we observe a possible equivalence between the SB and the 
NEG priors. 

Corollary 1. If a = 1 in Proposition 1, then TPBN = NEG. If (a, b, cj>) — (1, 1/2, 1) in Proposition 
1, then TPBN = SB = NEG. 

An interesting, yet expected, observation on Proposition 1 is that a half-Cauchy prior can be repre- 
sented as a scale mixture of gamma distributions, i.e. if tj ~ 5(1/2, Xj) and Xj ~ 5(1/2, </>), then 

Tj ~ C + (0, 1 / 2 ). This makes sense as t x / 2 \Xj has a half-Normal distribution and the mixing 
distribution on the precision parameter is gamma with shape parameter 1/2. 

Q further place a half-Cauchy prior on cj) 1 / 2 to complete the hierarchy. The aforementioned obser- 
vation helps us formulate the complete hierarchy proposed in (7] in a conjugate manner. This should 
bring analytical and computational advantages as well as making the application of the procedure 
much easier for the average user without the need for a relatively more complex sampling scheme. 

Corollary 2. If 9j - M(0,Tj), t] /2 - C+(O,0 1/2 ) and 1/2 - C+(0,1), then 6j ~ 

TVBAf (1/2, 1/2, <j)), (/) - 5(1/2, u) and uj - 5(1/2, 1). 

Hence disregarding the different treatments of the higher-level hyper-parameters, we have shown 
that the class of priors given in Definition 2 unites the priors in lfl6l ITTl [6] 13 under one family and 
reveals their close connections through the equivalence of hierarchies given in Proposition 1 . The 
first hierarchy in Proposition 1 makes much of the work possible in the following sections. 
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Figure 1: (a, b) = {(1/2, 1/2), (1, 1/2), (1, 1), (1/2, 2), (2, 2), (5, 2)} for (a)-(f) respectively. = 
{1/10, 1/9, 1/8, 1/7, 1/6, 1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} considered for all pairs of a 
and b. The line corresponding to the lowest value of <j> is drawn with a dashed line. 



4 Estimation and Posterior Inference in Regression Models 



4.1 Fully Bayes and Approximate Inference 



Consider the linear regression model, y = X/3 + e, where y is an n-dimensional vector of responses, 
X is the n x p design matrix and e is an n-dimensional vector of independent residuals which are 
normally distributed, Af(0, c 2 I„) with variance a 2 . 

We place the hierarchical prior given in Proposition 1 on each j3j, i.e. j3j ~ A/"(0, o Tj), 
tj ~ G{a, Aj), Xj ~ C7(fo, 0). is used as a global shrinkage parameter common to all j3j, and may 
be inferred using the data. Thus we follow the hierarchy by letting <fi ~ £7(1/2, to), u) ~ 0(1/2, 1) 
which implies 1 / 2 ~ C + (0, 1) that is identical to what was used in Q at this level of the hierarchy. 
However, we do not believe at this level in the hierarchy the choice of the prior will have a huge 
impact on the results. Although treating <\> as unknown may be reasonable, when there exists some 
prior knowledge, it is appropriate to fix a cf) value to reflect our prior belief in terms of underlying 
sparsity of the coefficient vector. This sounds rather natural as soon as one starts seeing as a pa- 
rameter that governs the multiplicity adjustment as discussed in |7|. Note also that here we form the 
dependence on the error variance at a lower level of hierarchy rather than forming it in the prior of 
as done in [7]. If we let a = b = 1/2, we will have formulated the hierarchical prior given in [7] 
in a completely conjugate manner. We also let <r~ 2 ~ Q(cq/2, do/2). Under a normal likelihood, 
an efficient Gibbs sampler may be obtained as the fully conditional posteriors can be extracted: 
/3|y,X,cr 2 ,Ti,...,T p - MiUfrVp), cr- 2 |y,X,/3,Ti,...,T p - G(c*,d*), tj\^,g 2 ,\j ~ 
giG(a~l/2,2X 3 ,f3]/a 2 ), \j\tj,<j>~ g(a+b,T J +<f>), <f>\\j,w ~ G(pb+l/2,J2%i \j+u),u>\<f> ~ 



0(1,^ + 1), where ft p = (X'X + T'^-^'y, Vp = <j 2 (X'X + T" 1 )- 1 , c* = (n+p + c Q )/2, 
d* = {(y - X/3)'(y - X/3) + /^T^ + d }/2, T = diag(n, . . . , t p ). 



As an alternative to MCMC and Laplace approximations |23|, a lower-bound on marginal like- 
lihoods may be obtained via variational methods [17] yielding approximate posterior distribu- 
tions on the model parameters. Using a similar approach to J5] Q], the approximate marginal 
posterior distributions of the parameters are given by (3 ~ A/"(/Xa, V^), a~ 2 ~ <?(c*,d*), 

£T0(a-l/2, 2(A,) ! (a- 2 )(/3|)), A, ~ 0(a+6, (r,) + (0)), - 0(p&+l/2, (w)+£* (A,-)), 



7" 



w - G(l, (0> + 1), where /^ = (/3> = (X'X + T-^^X'y, V^ = (<t- 2 ) _1 ( x ' x + T" 1 )" 1 , 
T- 1 =diag((rr 1 ),...,(r- 1 )),c* = (n + p + c )/2, d* = (y'y- 2y'X(/3) + E?=i x,(/3/3')x 4 + 
E^i^lXV 1 ) + rf °) A (/3/3'> = V;, + </3)</3'), (a- 2 ) = c*/d*, (A,) = (a + &)/(fo) + (</>)), 
(4>) = ( P b + l/2)/«w> + E?=i<Ai», (w) = l/«0) + 1) and 

((a- 2 )(/3 2 )) 1 / 2 K a+1/2 {(2(A,)( ( r- 2 )(/3 2 ))V2} 



<r) = 



(2(A,))V2 Ka _ 1/2 {(2(A J )(a- 2 )(/3 2 ))V2} 
(2(A J )) 1 /2K 3/2 _ a {(2(A J )(a- 2 )(/3 2 )) 1 /2} 



x ' ((a- 2 )(^ 2 ))V2 Kl/2 _ a {(2(A J )(a- 2 )(/3 2 ))V2}- 

This procedure consists of initializing the moments and iterating through them until some conver- 
gence criterion is reached. The deterministic nature of these approximations make them attractive 
as a quick alternative to MCMC. 

This conjugate modeling approach we have taken allows for a very straightforward implementation 
of Strawderman-Berger and horseshoe priors or, more generally, TPB normal scale mixture priors in 
regression models without the need for a more sophisticated sampling scheme which may ultimately 
attract more audiences towards the use of these more flexible and carefully defined normal scale 
mixture priors. 

4.2 Sparse Maximum a Posteriori Estimation 

Although not our main focus, many readers are interested in sparse solutions, hence we give the 
following brief discussion. Given a, b and (j>, maximum a posteriori (MAP) estimation is rather 
straightforward via a simple expectation-maximization (EM) procedure. This is accomplished in a 
similar manner to [8| by obtaining the joint MAP estimates of the error variance and the regression 
coefficients having taken the expectation with respect to the conditional posterior distribution of r _1 
using the second hierarchy given in Proposition 1 . The fcth expectation step then would consist of 
calculating 

, -n W = g° <~ 1/2 (! + ^r^ ^p{-^ (fc - 1) /(24_ 1) r J )}drr 1 



3 



joo t v 2+0(1 + T . m _ ia+b) ex p{-^( fc - 1 V(2af fc _ 1) r i )}dr7 1 



Oik Y\ o 

where p. ' and u, k _ l \ denote the modal estimates of the jth component of f3 and the error 



'(fc-i) 

variance a 2 at iteration (k — 1). The solution to (8|l may be expressed in terms of some special 
function(s) for changing values of a, b and <\>. b < 1 is a good choice as it will keep the tails of the 
marginal density on f3j heavy. A careful choice of a, on the other hand, is essential to sparse esti- 
mation. Admissible values of a for sparse estimation is apparent by the representation in Definition 
2, noting that for any a > 1, n(pj = 1) = 0, i.e. j3j may never be shrunk exactly to zero. Hence for 
sparse estimation, it is essential that < a < 1. Figure [2] (a) and (b) give the prior densities on pj 
for b = 1/2, (j> = 1 and a = {1/2, 1, 3/2} and the resulting marginal prior densities on /3j. These 
marginal densities are given by 

f 7fe e ^ /2r ( ^ 2 / 2 ) a = 1 /2 

*(&) = ] vfe " ¥^ /2 + § ^ /2 Erf(&A/2) a = l 

{ Jrfl-ie^m P 2 /2)} a = 3/2 

where Erf(.) denotes the error function and T(s, z) = J t s ~ 1 e~ t dt is the incomplete gamma 
function. Figure [2] clearly illustrates that while all three cases have very similar tail behavior, their 
behavior around the origin differ drastically. 





p 
(a) 



(b) 



Figure 2: Prior densities of (a) pj and (b) j3j for a — 1/2 (solid), a = 1 (dashed) and a = 3/2 (long 
dash). 



5 Experiments 



Throughout this section we use the Jeffreys' prior on the error precision by setting c = d = 0. We 
generate data for two cases, (n,p) — {(50, 20), (250, 100)}, from yi = x^/3* + e^, for i — 1, . . . , n 
where (3* is a p-vector that on average contains 20q non-zero elements which are indexed by the 
set A = {j : (3* y^ 0} for some random q E (0, 1). We randomize the procedure in the following 
manner: (i) C - W(p, I pxp ), (ii) x, - W(0, C), (iii) q ~ 6(1, 1) for the first and q ~ 6(1, 4) for 
the second cases, (iv) I(j € A) ~ Bernoulli^) for j = l,...,p where /(.) denotes the indicator 
function, (v) for j e A, fij ~ U{Q, 6) and for j <£ A, /3j = and finally (vi) e t ~ 7V(0, a 2 ) where 
a i-v W(0, 6). We generated 1000 data sets for each case resulting in a median signal-to-noise ratio 
of approximately 3.3 and 4.5. We obtain the estimate of the regression coefficients, (3, using the 
variational Bayes procedure and measure the performance by model error which is calculated as 
{(3* — (3)'C{(3* — (3). FigureBfa) and (b) display the median relative model error (RME) values 
(with their distributions obtained via bootstrapping) which is obtained by dividing the model error 
observed from our procedures by that of l\ regularization (lasso) tuned by 10-fold cross-validation. 
The boxplots in FigurepTa) and (b) correspond to different (a, b, cf>) values where C + signifies that </> 
is treated as unknown with a half-Cauchy prior as given earlier in Section 4. 1 . It is worth mentioning 
that we attain a clearly superior performance compared to the lasso, particularly in the second case, 
despite the fact that the estimator resulting from the variational Bayes procedure is not a thresholding 
rule. Note that 6=1 choice leads to much better performance under Case 2 than Case 1 . This is 
due to the fact that Case 2 involves a much sparser underlying setup on average than Case 1 and that 
the lighter tails attained by setting 6=1 leads to stronger shrinkage. 

To give a high dimensional example, we also generate a data set from the model i/i = x'^/3* + q, 
for i = 1, . . . , 100, where (3* is a 10000-dimensional very sparse vector with 10 randomly chosen 
components set to be 3, ti ~ A/"(0, 3 2 ) and Xij ~ A/"(0, 1) for j — 1, . . . ,p. This (3* choice leads 
to a signal-to-noise ratios of 3.16. For the particular data set we generated, the randomly chosen 
components of (3* to be non-zero were indexed by 1263, 2199, 2421, 4809, 5530, 7483, 7638, 7741, 
7891 and 8187. We set (a, 6, <j>) = (1, 1/2, 10" 4 ) which implies that a priori V( Pj > 0.5) = 0.99 
placing much more density in the neighborhood of pj = 1 (total shrinkage). This choice is due to the 
fact that n/p = 0.01 and to roughly reflect that we do not want any more than 100 predictors in the 
resulting model. Hence <\> is used, a priori, to limit the number of predictors in the model in relation 
to the sample size. Also note that with a = 1, the conditional posterior distribution of tJ is reduced 
to an inverse Gaussian. Since we are adjusting the global shrinkage parameter, <j), a priori, and it is 
chosen such that P(pj > 0.5) = 0.99, whether a — 1/2 or a = 1 should not matter. We first run 
the Gibbs sampler for 100000 iterations (2.4 hours on a computer with a 2.8 GHz CPU and 12 Gb 
of RAM using Mat 1 ab), discard the first 20000, thin the rest by picking every 5th sample to obtain 
the posteriors of the parameters. We observed that the chain converged by the 10000th iteration. 
For comparison purposes, we also ran the variational Bayes procedure using the values from the 
converged chain as the initial points (80 seconds). Figure |4] gives the posterior means attained by 
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Figure 3: Relative ME at different (a, b, </>) values for (a) Case 1 and (b) Case 2. 
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Figure 4: Posterior mean of /3 by sampling (square) and by approximate inference (circle). 



sampling and the variational approximation. The estimates corresponding to the zero elements of 
/3* are plotted with smaller shapes to prevent clutter. We see that in both cases the procedure is able 
to pick up the larger signals and shrink a significantly large portion of the rest towards zero. The 
approximate inference results are in accordance with the results from the Gibbs sampler. It should 
be noted that using a good informed guess on <\>, rather than treating it as an unknown in this high 
dimensional setting, improves the performance drastically. 

6 Discussion 

We conclude that the proposed hierarchical prior formulation constitutes a useful encompassing 
framework in understanding the behavior of different scale mixtures of normals and connecting 
them under a broader family of hierarchical priors. While t\ regularization, or namely lasso, 
arising from a double exponential prior in the Bayesian framework yields certain computational 
advantages, it demonstrates much inferior estimation performance relative to the more carefully 
formulated scale mixtures of normals. The proposed equivalence of the hierarchies in Proposi- 
tion 1 makes computation much easier for the TPB scale mixtures of normals. As per differ- 
ent choices of hyper-parameters, we recommend that a G (0,1] and b <G (0,1); in particular 
(a, b) = {(1/2, 1/2), (1, 1/2)}. These choices guarantee that the resulting prior has a kink at 
zero, which is essential for sparse estimation, and leads to heavy tails to avoid unnecessary bias 
in large signals (recall that a choice of b = 1/2 will yield Cauchy-like tails). In problems where 
oracle knowledge on sparsity exists or when p >> n, we recommend that <f> is fixed at a reasonable 
quantity to reflect an appropriate sparsity constraint as mentioned in Section 5. 
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